Non-redundant Subgroup Discovery in Large and Complex Data

نویسندگان

  • Matthijs van Leeuwen
  • Arno J. Knobbe
چکیده

Large and complex data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with Subgroup Discovery and its generalisation, Exceptional Model Mining. To address this, we introduce subgroup set mining: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these strategies in a beam search, the balance between exploration and exploitation is improved. Experiments clearly show that the proposed methods result in much more diverse subgroup sets than traditional Subgroup Discovery methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discovering statistically non-redundant subgroups

The objective of subgroup discovery is to find groups of individuals who are statistically different from others in a large data set. Most existing measures of the quality of subgroups are intuitive and do not precisely capture statistical differences of a group with the other, and their discovered results contain many redundant subgroups. Odds ratio is a statistically sound measure to quantify...

متن کامل

ASIC Design of Butterfly Unit Based on Non-Redundant and Redundant Algorithm

Fast Fourier Transform (FFT) processors employed with pipeline architecture consist of series of Processing Elements (PE) or Butterfly Units (BU). BU or PE of FFT performs multiplication and addition on complex numbers. This paper proposes a single BU to compute radix-2, 8 point FFT in the time domain as well as frequency domain by replacing a series of PEs. This BU comprises of fused floating ...

متن کامل

Handling large databases in data mining

M. Mehdi Owrang O. American University, Dept of Computer Science & IS, Washington DC 20016 [email protected] ABSTRACT Current database technology involves processing a large volume of data in order to discover new knowledge. The high volume of data makes discovery process computationally expensive. In addition, real-world databases tend to be incomplete, redundant, and inconsistent that could...

متن کامل

Handling Large Databases in Data Mining

M. Mehdi Owrang O. American University, Dept of Computer Science & IS, Washington DC 20016 [email protected] ABSTRACT Current database technology involves processing a large volume of data in order to discover new knowledge. The high volume of data makes discovery process computationally expensive. In addition, real-world databases tend to be incomplete, redundant, and inconsistent that could...

متن کامل

The False Discovery Rate in Simultaneous Fisher and Adjusted Permutation Hypothesis Testing on Microarray Data

Background and Objectives: In recent years, new technologies have led to produce a large amount of data and in the field of biology, microarray technology has also dramatically developed. Meanwhile, the Fisher test is used to compare the control group with two or more experimental groups and also to detect the differentially expressed genes. In this study, the false discovery rate was investiga...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011